In our final group assignment we will analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city.
#By utilizing the skim function, it is possible to define a schematic organization of our data and identify some crucial elements. It is possible to notice, first of all, that as far as the variable price is concerned, the dataset does not miss any value and the same is also true for the property type. The mean values for beds and bedrooms are both around 2 while bathrooms statistics are not available since their data are missing. The number of nights spent is between 6 and 8 and on avergae 3 people live in the house.
skim(listings)| Name | listings |
| Number of rows | 31030 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 24 |
| Date | 5 |
| logical | 8 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 31030 | 0 |
| name | 11 | 1.00 | 1 | 250 | 0 | 30309 | 0 |
| description | 1179 | 0.96 | 1 | 1000 | 0 | 29256 | 0 |
| neighborhood_overview | 12691 | 0.59 | 1 | 1000 | 0 | 16565 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 30547 | 0 |
| host_url | 0 | 1.00 | 39 | 43 | 0 | 23467 | 0 |
| host_name | 13 | 1.00 | 1 | 35 | 0 | 7400 | 0 |
| host_location | 45 | 1.00 | 2 | 255 | 0 | 1752 | 0 |
| host_about | 15101 | 0.51 | 1 | 9009 | 0 | 10796 | 22 |
| host_response_time | 13 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 13 | 1.00 | 2 | 4 | 0 | 41 | 0 |
| host_acceptance_rate | 13 | 1.00 | 2 | 4 | 0 | 95 | 0 |
| host_thumbnail_url | 13 | 1.00 | 55 | 106 | 0 | 23334 | 0 |
| host_picture_url | 13 | 1.00 | 57 | 109 | 0 | 23334 | 0 |
| host_neighbourhood | 12457 | 0.60 | 4 | 30 | 0 | 233 | 0 |
| host_verifications | 0 | 1.00 | 2 | 161 | 0 | 443 | 0 |
| neighbourhood | 12690 | 0.59 | 9 | 60 | 0 | 699 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 16 | 0 | 38 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 94 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 34 | 1.00 | 6 | 17 | 0 | 34 | 0 |
| amenities | 0 | 1.00 | 2 | 1520 | 0 | 28475 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 1002 | 0 |
| license | 29649 | 0.04 | 3 | 20 | 0 | 833 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-08 | 2021-09-09 | 2021-09-09 | 2 |
| host_since | 13 | 1.00 | 2009-03-20 | 2021-09-02 | 2015-12-30 | 3578 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-08 | 2021-09-09 | 2021-09-09 | 2 |
| first_review | 9148 | 0.71 | 2011-03-09 | 2021-09-07 | 2018-09-21 | 2703 |
| last_review | 9148 | 0.71 | 2011-11-16 | 2021-09-09 | 2019-07-29 | 2394 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 13 | 1 | 0.12 | FAL: 27336, TRU: 3681 |
| host_has_profile_pic | 13 | 1 | 1.00 | TRU: 30876, FAL: 141 |
| host_identity_verified | 13 | 1 | 0.75 | TRU: 23184, FAL: 7833 |
| neighbourhood_group_cleansed | 31030 | 0 | NaN | : |
| bathrooms | 31030 | 0 | NaN | : |
| calendar_updated | 31030 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.99 | TRU: 30640, FAL: 390 |
| instant_bookable | 0 | 1 | 0.36 | FAL: 19714, TRU: 11316 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.577473e+07 | 13670837.08 | 1.115600e+04 | 1.500225e+07 | 2.412513e+07 | 3.826905e+07 | 5.211617e+07 | ▆▆▇▆▅ |
| scrape_id | 0 | 1.00 | 2.021091e+13 | 0.00 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 9.613615e+07 | 99177732.26 | 1.085700e+04 | 1.982952e+07 | 5.237223e+07 | 1.524548e+08 | 4.212017e+08 | ▇▂▁▁▁ |
| host_listings_count | 13 | 1.00 | 1.508000e+01 | 158.48 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.508000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 13 | 1.00 | 1.508000e+01 | 158.48 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.508000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | -3.386000e+01 | 0.07 | -3.414000e+01 | -3.390000e+01 | -3.388000e+01 | -3.383000e+01 | -3.340000e+01 | ▁▇▃▁▁ |
| longitude | 0 | 1.00 | 1.512000e+02 | 0.09 | 1.506000e+02 | 1.511800e+02 | 1.512200e+02 | 1.512600e+02 | 1.513400e+02 | ▁▁▁▃▇ |
| accommodates | 0 | 1.00 | 3.240000e+00 | 2.12 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| bedrooms | 1998 | 0.94 | 1.660000e+00 | 1.05 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.600000e+01 | ▇▁▁▁▁ |
| beds | 395 | 0.99 | 1.910000e+00 | 1.48 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.700000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 6.560000e+00 | 31.66 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.569800e+02 | 528.24 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 1.500000e+03 | ▆▁▁▇▁ |
| minimum_minimum_nights | 0 | 1.00 | 6.400000e+00 | 31.04 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 6.980000e+00 | 31.48 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 7.620090e+05 | 40426410.17 | 1.000000e+00 | 3.200000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 1.869321e+06 | 63319675.46 | 1.000000e+00 | 3.500000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 6.680000e+00 | 31.22 | 1.000000e+00 | 1.300000e+00 | 3.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 1.866813e+06 | 63234821.62 | 1.000000e+00 | 3.500000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 8.350000e+00 | 12.52 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.300000e+01 | 3.000000e+01 | ▇▁▁▁▃ |
| availability_60 | 0 | 1.00 | 1.794000e+01 | 25.68 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.300000e+01 | 6.000000e+01 | ▇▁▁▁▃ |
| availability_90 | 0 | 1.00 | 2.791000e+01 | 38.94 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 8.000000e+01 | 9.000000e+01 | ▇▁▁▁▃ |
| availability_365 | 0 | 1.00 | 8.871000e+01 | 130.94 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.700000e+02 | 3.650000e+02 | ▇▁▁▁▂ |
| number_of_reviews | 0 | 1.00 | 1.479000e+01 | 38.25 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 1.000000e+01 | 8.360000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 2.250000e+00 | 8.84 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.690000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 4.000000e-02 | 0.31 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.400000e+01 | ▇▁▁▁▁ |
| review_scores_rating | 9148 | 0.71 | 4.410000e+00 | 1.17 | 0.000000e+00 | 4.500000e+00 | 4.820000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 10324 | 0.67 | 4.740000e+00 | 0.53 | 0.000000e+00 | 4.700000e+00 | 4.930000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 10308 | 0.67 | 4.580000e+00 | 0.65 | 0.000000e+00 | 4.500000e+00 | 4.810000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 10336 | 0.67 | 4.830000e+00 | 0.45 | 0.000000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 10314 | 0.67 | 4.830000e+00 | 0.47 | 0.000000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 10335 | 0.67 | 4.820000e+00 | 0.40 | 0.000000e+00 | 4.800000e+00 | 4.980000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 10343 | 0.67 | 4.640000e+00 | 0.55 | 0.000000e+00 | 4.500000e+00 | 4.800000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 6.360000e+00 | 21.96 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.990000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 5.120000e+00 | 21.20 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.990000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.130000e+00 | 5.72 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 9.300000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 6.000000e-02 | 0.59 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.700000e+01 | ▇▁▁▁▁ |
| reviews_per_month | 9148 | 0.71 | 6.400000e-01 | 1.31 | 1.000000e-02 | 5.000000e-02 | 1.500000e-01 | 6.800000e-01 | 5.400000e+01 | ▇▁▁▁▁ |
favstats(~number_of_reviews, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2 | 10 | 836 | 14.8 | 38.2 | 31030 | 0 |
# there are on average 15 reviews per house but with notable outliers since the median number is 2.
favstats(~reviews_per_month, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0.01 | 0.05 | 0.15 | 0.68 | 54 | 0.64 | 1.31 | 21882 | 9148 |
# per month there are almost 0.5 reviews with values that can shift from 0 to 54 per month.
favstats(~review_scores_rating, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 4.5 | 4.82 | 5 | 5 | 4.41 | 1.17 | 21882 | 9148 |
# this statistic is crucial and very informative: it results quite unbiased with a median of 4.82 and a mean of 4.41 (values are included between 0 and 5) computed from more than 10000 reviews.
favstats(~bathrooms, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| NaN | 0 | 31030 |
#it is important to highlight that the statistic about bathrooms is not available because of missing values.
favstats(~review_scores_cleanliness, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 4.5 | 4.81 | 5 | 5 | 4.58 | 0.653 | 20722 | 10308 |
# as in the case of ratings, cleanliness results to be unbiased with several observations (more than 10000) and shows a median of 4.81 and a mean of 4.58 (with values between 0 and 5).
favstats(~review_scores_communication, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 4.86 | 5 | 5 | 5 | 4.83 | 0.471 | 20716 | 10314 |
# communication too appears to be unbiased and reliable with more than 10000 observations.
favstats(~review_scores_checkin, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 4.85 | 5 | 5 | 5 | 4.83 | 0.452 | 20694 | 10336 |
favstats(~review_scores_location, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 4.8 | 4.98 | 5 | 5 | 4.82 | 0.399 | 20695 | 10335 |
favstats(~maximum_nights, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 30 | 1.12e+03 | 1.12e+03 | 1.5e+03 | 657 | 528 | 31030 | 0 |
favstats(~minimum_nights, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2 | 5 | 1.12e+03 | 6.56 | 31.7 | 31030 | 0 |
# no missing data for both variables, with more than 30000 observations. Standard deviations appear to be very high.ggplot(data=listings, aes(x=review_scores_rating , y=review_scores_cleanliness , group=1)) +
geom_point()+
ggtitle("Relationship between ratings and cleanliness scores") +
xlab("Ratings") + ylab("Cleanliness")#we tested the relationship between ratings and cleaniliness scores in order to understand how changes in the feeling of cleaniliness affect the overall rating score.
ggplot(data=listings, aes(x=host_identity_verified )) +
geom_bar(color="black", fill="white")+
ggtitle("Number of verified hosts per listing")+
xlab("Verified hosts") + ylab("Number of Listings")# We analyzed the values regarding the verified hosts to understand whether AirBNB considers verified hosts.
ggplot(data=listings, aes (x= review_scores_location ))+
geom_histogram()+
stat_bin(bins=30)+
ggtitle("Location ratings") +
xlab("Ratings") + ylab("Number of Listings")# We considered also the scores on the basis of the different locations in Sydneyskim(listings)| Name | listings |
| Number of rows | 31030 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 24 |
| Date | 5 |
| logical | 8 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 31030 | 0 |
| name | 11 | 1.00 | 1 | 250 | 0 | 30309 | 0 |
| description | 1179 | 0.96 | 1 | 1000 | 0 | 29256 | 0 |
| neighborhood_overview | 12691 | 0.59 | 1 | 1000 | 0 | 16565 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 30547 | 0 |
| host_url | 0 | 1.00 | 39 | 43 | 0 | 23467 | 0 |
| host_name | 13 | 1.00 | 1 | 35 | 0 | 7400 | 0 |
| host_location | 45 | 1.00 | 2 | 255 | 0 | 1752 | 0 |
| host_about | 15101 | 0.51 | 1 | 9009 | 0 | 10796 | 22 |
| host_response_time | 13 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 13 | 1.00 | 2 | 4 | 0 | 41 | 0 |
| host_acceptance_rate | 13 | 1.00 | 2 | 4 | 0 | 95 | 0 |
| host_thumbnail_url | 13 | 1.00 | 55 | 106 | 0 | 23334 | 0 |
| host_picture_url | 13 | 1.00 | 57 | 109 | 0 | 23334 | 0 |
| host_neighbourhood | 12457 | 0.60 | 4 | 30 | 0 | 233 | 0 |
| host_verifications | 0 | 1.00 | 2 | 161 | 0 | 443 | 0 |
| neighbourhood | 12690 | 0.59 | 9 | 60 | 0 | 699 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 16 | 0 | 38 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 94 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 34 | 1.00 | 6 | 17 | 0 | 34 | 0 |
| amenities | 0 | 1.00 | 2 | 1520 | 0 | 28475 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 1002 | 0 |
| license | 29649 | 0.04 | 3 | 20 | 0 | 833 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-08 | 2021-09-09 | 2021-09-09 | 2 |
| host_since | 13 | 1.00 | 2009-03-20 | 2021-09-02 | 2015-12-30 | 3578 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-08 | 2021-09-09 | 2021-09-09 | 2 |
| first_review | 9148 | 0.71 | 2011-03-09 | 2021-09-07 | 2018-09-21 | 2703 |
| last_review | 9148 | 0.71 | 2011-11-16 | 2021-09-09 | 2019-07-29 | 2394 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 13 | 1 | 0.12 | FAL: 27336, TRU: 3681 |
| host_has_profile_pic | 13 | 1 | 1.00 | TRU: 30876, FAL: 141 |
| host_identity_verified | 13 | 1 | 0.75 | TRU: 23184, FAL: 7833 |
| neighbourhood_group_cleansed | 31030 | 0 | NaN | : |
| bathrooms | 31030 | 0 | NaN | : |
| calendar_updated | 31030 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.99 | TRU: 30640, FAL: 390 |
| instant_bookable | 0 | 1 | 0.36 | FAL: 19714, TRU: 11316 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.577473e+07 | 13670837.08 | 1.115600e+04 | 1.500225e+07 | 2.412513e+07 | 3.826905e+07 | 5.211617e+07 | ▆▆▇▆▅ |
| scrape_id | 0 | 1.00 | 2.021091e+13 | 0.00 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 9.613615e+07 | 99177732.26 | 1.085700e+04 | 1.982952e+07 | 5.237223e+07 | 1.524548e+08 | 4.212017e+08 | ▇▂▁▁▁ |
| host_listings_count | 13 | 1.00 | 1.508000e+01 | 158.48 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.508000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 13 | 1.00 | 1.508000e+01 | 158.48 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.508000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | -3.386000e+01 | 0.07 | -3.414000e+01 | -3.390000e+01 | -3.388000e+01 | -3.383000e+01 | -3.340000e+01 | ▁▇▃▁▁ |
| longitude | 0 | 1.00 | 1.512000e+02 | 0.09 | 1.506000e+02 | 1.511800e+02 | 1.512200e+02 | 1.512600e+02 | 1.513400e+02 | ▁▁▁▃▇ |
| accommodates | 0 | 1.00 | 3.240000e+00 | 2.12 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| bedrooms | 1998 | 0.94 | 1.660000e+00 | 1.05 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.600000e+01 | ▇▁▁▁▁ |
| beds | 395 | 0.99 | 1.910000e+00 | 1.48 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.700000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 6.560000e+00 | 31.66 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.569800e+02 | 528.24 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 1.500000e+03 | ▆▁▁▇▁ |
| minimum_minimum_nights | 0 | 1.00 | 6.400000e+00 | 31.04 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 6.980000e+00 | 31.48 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 7.620090e+05 | 40426410.17 | 1.000000e+00 | 3.200000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 1.869321e+06 | 63319675.46 | 1.000000e+00 | 3.500000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 6.680000e+00 | 31.22 | 1.000000e+00 | 1.300000e+00 | 3.000000e+00 | 5.000000e+00 | 1.125000e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 1.866813e+06 | 63234821.62 | 1.000000e+00 | 3.500000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 8.350000e+00 | 12.52 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.300000e+01 | 3.000000e+01 | ▇▁▁▁▃ |
| availability_60 | 0 | 1.00 | 1.794000e+01 | 25.68 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.300000e+01 | 6.000000e+01 | ▇▁▁▁▃ |
| availability_90 | 0 | 1.00 | 2.791000e+01 | 38.94 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 8.000000e+01 | 9.000000e+01 | ▇▁▁▁▃ |
| availability_365 | 0 | 1.00 | 8.871000e+01 | 130.94 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.700000e+02 | 3.650000e+02 | ▇▁▁▁▂ |
| number_of_reviews | 0 | 1.00 | 1.479000e+01 | 38.25 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 1.000000e+01 | 8.360000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 2.250000e+00 | 8.84 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.690000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 4.000000e-02 | 0.31 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.400000e+01 | ▇▁▁▁▁ |
| review_scores_rating | 9148 | 0.71 | 4.410000e+00 | 1.17 | 0.000000e+00 | 4.500000e+00 | 4.820000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 10324 | 0.67 | 4.740000e+00 | 0.53 | 0.000000e+00 | 4.700000e+00 | 4.930000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 10308 | 0.67 | 4.580000e+00 | 0.65 | 0.000000e+00 | 4.500000e+00 | 4.810000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 10336 | 0.67 | 4.830000e+00 | 0.45 | 0.000000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 10314 | 0.67 | 4.830000e+00 | 0.47 | 0.000000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 10335 | 0.67 | 4.820000e+00 | 0.40 | 0.000000e+00 | 4.800000e+00 | 4.980000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 10343 | 0.67 | 4.640000e+00 | 0.55 | 0.000000e+00 | 4.500000e+00 | 4.800000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 6.360000e+00 | 21.96 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.990000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 5.120000e+00 | 21.20 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.990000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.130000e+00 | 5.72 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 9.300000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 6.000000e-02 | 0.59 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.700000e+01 | ▇▁▁▁▁ |
| reviews_per_month | 9148 | 0.71 | 6.400000e-01 | 1.31 | 1.000000e-02 | 5.000000e-02 | 1.500000e-01 | 6.800000e-01 | 5.400000e+01 | ▇▁▁▁▁ |
listings <- listings %>%
mutate(price = parse_number(as.character(price))) #converting price from char to double
typeof(listings$price)[1] "double"
listings_num_variables <- listings %>%
select(where(is.numeric))
dropping<- c("host_total_listings_count", "price", "scrape_id", "host_id", "latitude", "longitude", "minimum_minimum_night", "minimum_maximum_night", "maximum_minimum_night", "maximum_minimum_night", "availability_30", "availability_60 ", "availability_90","availability_365","calculated_host_listings_count","calculated_host_listings_count_entire_homes","calculated_host_listings_count_private_rooms", "calculated_host_listings_count_shared_rooms", "minimum_minimum_nights","maximum_minimum_nights", "minimum_maximum_nights","maximum_maximum_nights", "minimum_nights_avg_ntm", "maximum_nights_avg_ntm")
listings_num_variables<- listings_num_variables[, !(names(listings_num_variables)%in% dropping)] #Removed deplicated variables
listings_num_variables <- listings_num_variables %>%
filter(maximum_nights< 360)
listings_num_variables
library(corrplot)
correl = cor(listings_num_variables, use="pairwise.complete.obs")
corrplot(correl, method = "circle") #let's look at the correlation in a different formdropping2<- c("beds", "number_of_reviews_ltm","number_of_reviews_l30d","review_scores_accuracy", "review_scores_checkin", "review_scores_location", "review_scores_value", "review_per_month", "host_listings_count")
listings_num_variables<- listings_num_variables[, !(names(listings_num_variables)%in% dropping2)] #Removed other less relevant variables
listings_num_variables
ggpairs(listings_num_variables) #Analyzing the correlations between the variablesskim(listings_num_variables)From the initial dataset we have tried to eliminate all repetitive or less useful variables. We have initially statistically analyze the dataset to comprehend the situation at the beginning by utilizing skrim and listings. Later on, we modified the dataset and we start analyzing potential correlations which results from the graphs. Specifically, it appears that correlations do not result to be linear.
We had 74 variables at the beginning, we achieved a final value of 11 after the different drops. We had 31030 observations at the beginnig, we achieved a final value of 12314 after modifying the dataset.
At the beginning they were: id, scrape_id, host_id, host_listings_count, host_total_listings_count, latitude, longitude, accommodates, bedrooms,beds, minimum_nights, maximum_nights, minimum_minimum_nights, maximum_minimum_nights, minimum_maximum_nights, maximum_maximum_nights, minimum_nights_avg_ntm, maximum_nights_avg_ntm, availability_30, availability_60, availability_90, availability_365, number_of_reviews, number_of_reviews_ltm, number_of_reviews_l30d, review_scores_rating review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms, reviews_per_month.
After modifying the dataset: id, accommodates, bedrooms, minimum_nights, maximum_nights, availability_60, number_of_reviews, review_scores_rating, review_scores_cleanliness, review_scores_communication, reviews_per_month. - Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
listing_url, name, description, neighborhood_overview, picture_url, host_url, host_name, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_thumbnail_url, host_picture_url, host_neighbourhood, host_verifications, neighbourhood, neighbourhood_cleansed, property_type, room_type, bathrooms_text, amenities,price,license.
The correlations demonstrate that the variables are not linearly related.Outside of the same typology of class (the different types of reviews for instance), we have correlation coefficients that are quite low. This is because many variables are not linearly correlated even though.
listings <- listings %>%
mutate(price = parse_number(as.character(price))) #converting price from character to double
typeof(listings$price)[1] "double"
Used typeof(listing$price) to confirm that price is now stored as a number.
Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?
listings_category <- listings %>% #find the rankings of each category of property type
group_by(property_type) %>%
summarise(number_of_category=count(property_type)) %>% #counting each property type
arrange(desc(number_of_category)) %>% # descending
mutate(all_category=sum(number_of_category),category_proportion=number_of_category/all_category) %>% #property type as proportion of total number of categories
head(4) #finding the top 4 most common property types
listings_category| property_type | number_of_category | all_category | category_proportion |
|---|---|---|---|
| Entire rental unit | 11678 | 31030 | 0.376 |
| Private room in rental unit | 5870 | 31030 | 0.189 |
| Entire residential home | 4499 | 31030 | 0.145 |
| Private room in residential home | 3740 | 31030 | 0.121 |
listings_category_top_4 <- listings_category %>% #what proportion of all properties are made up by our top 4?
summarise(top_four_category=sum(category_proportion))
listings_category_top_4| top_four_category |
|---|
| 0.831 |
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.
Use the code below to check that prop_type_simplified was correctly made.
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit", "Entire residential home","Private room in residential home") ~ property_type,
TRUE ~ "Other"
))listings %>% #the above new column displayed and compared with old classification
count(property_type, prop_type_simplified) %>%
arrange(desc(n)) | property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 11678 |
| Private room in rental unit | Private room in rental unit | 5870 |
| Entire residential home | Entire residential home | 4499 |
| Private room in residential home | Private room in residential home | 3740 |
| Private room in townhouse | Other | 666 |
| Entire townhouse | Other | 535 |
| Entire guest suite | Other | 523 |
| Entire guesthouse | Other | 382 |
| Entire condominium (condo) | Other | 335 |
| Shared room in rental unit | Other | 332 |
| Room in boutique hotel | Other | 251 |
| Private room in condominium (condo) | Other | 239 |
| Private room in villa | Other | 147 |
| Entire serviced apartment | Other | 135 |
| Private room in guest suite | Other | 130 |
| Room in hotel | Other | 126 |
| Entire loft | Other | 124 |
| Entire cottage | Other | 122 |
| Private room in guesthouse | Other | 119 |
| Entire villa | Other | 118 |
| Shared room in residential home | Other | 112 |
| Entire bungalow | Other | 87 |
| Private room in bed and breakfast | Other | 75 |
| Private room in hostel | Other | 74 |
| Shared room in hostel | Other | 51 |
| Private room in bungalow | Other | 46 |
| Room in aparthotel | Other | 43 |
| Room in serviced apartment | Other | 39 |
| Private room in loft | Other | 34 |
| Entire cabin | Other | 33 |
| Private room in serviced apartment | Other | 33 |
| Entire place | Other | 32 |
| Tiny house | Other | 29 |
| Private room | Other | 22 |
| Shared room in condominium (condo) | Other | 20 |
| Boat | Other | 17 |
| Private room in cabin | Other | 15 |
| Camper/RV | Other | 12 |
| Room in hostel | Other | 12 |
| Shared room in guesthouse | Other | 12 |
| Shared room in townhouse | Other | 12 |
| Shared room in villa | Other | 12 |
| Farm stay | Other | 11 |
| Private room in cottage | Other | 10 |
| Room in bed and breakfast | Other | 9 |
| Private room in tiny house | Other | 8 |
| Shared room in bed and breakfast | Other | 8 |
| Private room in casa particular | Other | 5 |
| Tent | Other | 5 |
| Earth house | Other | 4 |
| Entire chalet | Other | 4 |
| Private room in boat | Other | 4 |
| Private room in earth house | Other | 4 |
| Shared room in guest suite | Other | 4 |
| Barn | Other | 3 |
| Entire home/apt | Other | 3 |
| Floor | Other | 3 |
| Island | Other | 3 |
| Private room in camper/rv | Other | 3 |
| Private room in farm stay | Other | 3 |
| Private room in tent | Other | 3 |
| Casa particular | Other | 2 |
| Holiday park | Other | 2 |
| Private room in barn | Other | 2 |
| Private room in bus | Other | 2 |
| Private room in chalet | Other | 2 |
| Shared room in loft | Other | 2 |
| Shared room in serviced apartment | Other | 2 |
| Bus | Other | 1 |
| Campsite | Other | 1 |
| Castle | Other | 1 |
| Cave | Other | 1 |
| Dome house | Other | 1 |
| Private room in in-law | Other | 1 |
| Private room in island | Other | 1 |
| Private room in minsu | Other | 1 |
| Private room in nature lodge | Other | 1 |
| Private room in pension | Other | 1 |
| Private room in resort | Other | 1 |
| Private room in tipi | Other | 1 |
| Private room in train | Other | 1 |
| Private room in yurt | Other | 1 |
| Room in resort | Other | 1 |
| Shared room in boat | Other | 1 |
| Shared room in boutique hotel | Other | 1 |
| Shared room in cave | Other | 1 |
| Shared room in cottage | Other | 1 |
| Shared room in earth house | Other | 1 |
| Shared room in farm stay | Other | 1 |
| Shared room in tent | Other | 1 |
| Shared room in tiny house | Other | 1 |
| Train | Other | 1 |
| Treehouse | Other | 1 |
| Yurt | Other | 1 |
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:
minimum_nights?listings %>% #by taking a count, we can figure out which values are most common
count(minimum_nights) %>%
arrange(desc(n))| minimum_nights | n |
|---|---|
| 1 | 8203 |
| 2 | 7320 |
| 3 | 4468 |
| 7 | 3127 |
| 5 | 2677 |
| 4 | 1774 |
| 14 | 781 |
| 10 | 475 |
| 6 | 439 |
| 30 | 265 |
| 21 | 202 |
| 90 | 163 |
| 28 | 113 |
| 15 | 106 |
| 20 | 96 |
| 8 | 95 |
| 31 | 81 |
| 12 | 72 |
| 60 | 63 |
| 9 | 52 |
| 13 | 39 |
| 365 | 38 |
| 180 | 34 |
| 25 | 27 |
| 100 | 23 |
| 16 | 15 |
| 19 | 14 |
| 11 | 13 |
| 29 | 13 |
| 50 | 13 |
| 18 | 12 |
| 35 | 11 |
| 40 | 11 |
| 120 | 11 |
| 24 | 10 |
| 17 | 9 |
| 23 | 9 |
| 45 | 8 |
| 91 | 8 |
| 360 | 8 |
| 42 | 7 |
| 70 | 7 |
| 27 | 6 |
| 300 | 6 |
| 1e+03 | 6 |
| 56 | 5 |
| 150 | 5 |
| 200 | 5 |
| 1.12e+03 | 5 |
| 22 | 4 |
| 55 | 3 |
| 58 | 3 |
| 80 | 3 |
| 500 | 3 |
| 1.1e+03 | 3 |
| 26 | 2 |
| 32 | 2 |
| 34 | 2 |
| 47 | 2 |
| 48 | 2 |
| 84 | 2 |
| 92 | 2 |
| 183 | 2 |
| 222 | 2 |
| 240 | 2 |
| 364 | 2 |
| 33 | 1 |
| 37 | 1 |
| 44 | 1 |
| 49 | 1 |
| 51 | 1 |
| 62 | 1 |
| 74 | 1 |
| 75 | 1 |
| 83 | 1 |
| 85 | 1 |
| 87 | 1 |
| 89 | 1 |
| 93 | 1 |
| 94 | 1 |
| 95 | 1 |
| 96 | 1 |
| 99 | 1 |
| 115 | 1 |
| 130 | 1 |
| 132 | 1 |
| 149 | 1 |
| 152 | 1 |
| 168 | 1 |
| 178 | 1 |
| 179 | 1 |
| 182 | 1 |
| 185 | 1 |
| 190 | 1 |
| 198 | 1 |
| 199 | 1 |
| 211 | 1 |
| 220 | 1 |
| 256 | 1 |
| 280 | 1 |
| 333 | 1 |
| 395 | 1 |
| 700 | 1 |
| 999 | 1 |
| 1.12e+03 | 1 |
minimum_nights?7 days minimum stay is more common than 4,5 and 6 days, which stands out as it is for a longer period of time. but this can be justified since it is likely that rentors would like to have their properties in use for a week at a time rather than have the renting period finish randomly midweek.
The following code, having downloaded a dataframe listings with all AirbnB listings in Sydney, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4).
leaflet(data = filter(listings, minimum_nights <= 4)) %>% #Using leaflet to display a map of the properties in our dataframe with minimum nights fewer than 4
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.
We will create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
# in this part i delete neighbourhood_group_cleaned because it will be used in the analysis part(kostis asked us to do) and license(since it may have some impact in the final model)
drop_columns <- (c("id", #useless in our analysis
"listing_url", #useless in our analysis
"scrape_id", #useless in our analysis
"last_scraped", #useless in our analysis
"name", #useless in our analysis
"description", #useless in our analysis
"neighborhood_overview", #useless in our analysis
"picture_url", #useless in our analysis
"host_id", #useless in our analysis
"host_url", #useless in our analysis
"host_name", #useless in our analysis
"host_about", #useless in our analysis
"host_thumbnail_url", #useless in our analysis
"host_picture_url", #useless in our analysis
"bathrooms", #contains only NAs
"minimum_minimum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
"maximum_minimum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
"minimum_maximum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
"maximum_maximum_nights", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
"minimum_nights_avg_ntm", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
"maximum_nights_avg_ntm", #inconsistent data, removed following this advice: https://medium.com/@kalenderselmir/munichs-airbnb-data-analysis-fd815f2c918f
"calendar_updated", #contains only NAs
"calendar_last_scraped",
"first_review",
"last_review",
"calendar_updated",
"calculated_host_listings_count",
"calculated_host_listings_count_entire_homes",
"calculated_host_listings_count_private_rooms",
"calculated_host_listings_count_shared_rooms"))
listings_sydney <- listings %>% # creating a new dataframe without the useless columns, keeping our old df intact
select(-drop_columns)
bathrooms_list<-unique(as.character(listings_sydney$bathrooms_text))
listings_sydney_2 <- listings_sydney %>% # we withdraw the numbers from the below strings
mutate(bathrooms_number=case_when(bathrooms_text=="1 shared bath"~1,
bathrooms_text=="3 baths"~3,
bathrooms_text=="1 private bath"~1,
bathrooms_text=="1 bath"~1,
bathrooms_text=="1.5 shared baths"~1.5,
bathrooms_text=="2.5 shared baths"~2.5,
bathrooms_text=="2 baths"~2,
bathrooms_text=="1.5 baths"~1.5,
bathrooms_text=="2.5 baths"~2.5,
bathrooms_text=="0 baths"~0,
bathrooms_text=="2 shared baths"~2,
bathrooms_text=="4 baths"~4,
bathrooms_text=="3 shared baths"~3,
bathrooms_text=="Half-bath"~0.5,
bathrooms_text=="Shared half-bath"~0.5,
bathrooms_text=="3.5 baths"~3.5,
bathrooms_text=="3.5 shared baths"~3.5,
bathrooms_text=="5 baths"~5,
bathrooms_text=="4.5 baths"~4.5,
bathrooms_text=="0 shared baths"~0,
bathrooms_text=="6 baths"~6,
bathrooms_text=="5.5 bathss"~5.5,
bathrooms_text=="6 shared bath"~6,
bathrooms_text=="Private half-bath"~0.5,
bathrooms_text=="8 baths"~8,
bathrooms_text=="4 shared baths"~4,
bathrooms_text=="7 baths"~7,
bathrooms_text=="6.5 baths"~6.5,
bathrooms_text=="5.5 shared baths"~5.5,
bathrooms_text=="4.5 shared baths"~4.5,
bathrooms_text=="5 shared bathss"~5,
bathrooms_text=="14.5 shared baths"~14.5,
bathrooms_text=="7 shared baths"~7,
bathrooms_text=="10 baths"~10))
#coordinates for Sydney Opera house: latitude -33.8568°, longitude 151.2153°
#forumla for distance between two coordinates: sqrt((x1-x2)^2+(y1-y2)^2)
listings_sydney_opera_distance <- listings_sydney_2 %>%
mutate(distance_opera=sqrt((latitude-(-33.8568))^2+(longitude-151.2153)^2))
listings_sydney_golden <- listings_sydney_opera_distance %>% #we now create our 'Golden' dataframe that we will use for the rest of our analysis
filter(grepl("Sydney",host_location)) %>% #filter for location name to include "Sydney"
filter(accommodates>=2, #to find price for 4 nights for 2 people first we restrict to those properties that can accommodate 2 or more people
minimum_nights<=4) %>% #we can't consider properties that require you to stay for more than 4 nights and hence filter them out
mutate(price_4_nights=price*4) %>% #we now take the price per night for the rooms that satisfy the above and multiply by 4 to get `price_4_nights`
arrange(desc(price_4_nights)) %>% # note that we aren't calculating pro rata and multiplying by two as we are assuming that if 2 people book a property for 4 people, they still have to pay full price
mutate(log_price_4_nights=log(price_4_nights))We now use histograms and density plots to examine the distributions of price_4_nights and log_price_4_nights. Which variable should you use for the regression model? Why?
ggplot(listings_sydney_golden,aes(price_4_nights))+ #price_4_nights is a very right skewed data set
geom_density(aes(x=price_4_nights))ggplot(listings_sydney_golden,aes(price_4_nights))+ #...which leads to a very right skewed histogram
geom_histogram(aes(x=price_4_nights), bins=50)ggplot(listings_sydney_golden,aes(log_price_4_nights))+ #log_price_4_nights is still right skewed but a lot less so
geom_density(aes(x=log_price_4_nights))ggplot(listings_sydney_golden,aes(log_price_4_nights))+ #... further highlighted by this histogram showing a close to normal distribution
geom_histogram(aes(x=log_price_4_nights),bins=30)We will fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
model1 <-lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating, data = listings_sydney_golden)
model1 %>%
glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.361 | 0.36 | 0.553 | 465 | 0 | 6 | -4.09e+03 | 8.19e+03 | 8.24e+03 | 1.51e+03 | 4939 | 4946 |
msummary(model1) Estimate Std. Error
(Intercept) 6.2770556 0.0402906
prop_type_simplifiedEntire residential home 0.6901556 0.0262279
prop_type_simplifiedOther -0.1652345 0.0228788
prop_type_simplifiedPrivate room in rental unit -0.6890471 0.0234864
prop_type_simplifiedPrivate room in residential home -0.7608048 0.0270010
number_of_reviews -0.0003980 0.0001498
review_scores_rating 0.0263210 0.0085037
t value Pr(>|t|)
(Intercept) 155.795 < 2e-16 ***
prop_type_simplifiedEntire residential home 26.314 < 2e-16 ***
prop_type_simplifiedOther -7.222 5.9e-13 ***
prop_type_simplifiedPrivate room in rental unit -29.338 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -28.177 < 2e-16 ***
number_of_reviews -2.657 0.00792 **
review_scores_rating 3.095 0.00198 **
Residual standard error: 0.5534 on 4939 degrees of freedom
(1165 observations deleted due to missingness)
Multiple R-squared: 0.3611, Adjusted R-squared: 0.3603
F-statistic: 465.3 on 6 and 4939 DF, p-value: < 2.2e-16
pairs.panels(listings_sydney_golden[c("prop_type_simplified","number_of_reviews","review_scores_rating")])autoplot(model1)+theme_bw()review_scores_rating in terms of log_price_4_nights.The coefficient is statistically significant and represents a 2.6% change in our Y for every 1 increase in rating.
prop_type_simplified in terms of log_price_4_nights.The coefficients of each property type is significant. Each category can only take a value of 0 or 1, depending on if the property is in the category or not. generally, renting an “Entire residential home” will lead to an increase in price, whilst renting “Private room in rental unit”, “Private room in residential home”, or any other type of property will lead to a decrease in price- all highlighted by the sign of the coefficients.
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.
model2 <-lm(log_price_4_nights ~ prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data = listings_sydney_golden)
msummary(model2) Estimate Std. Error
(Intercept) 6.3022284 0.0392597
prop_type_simplifiedEntire residential home 0.6893621 0.0255180
prop_type_simplifiedOther 0.0636344 0.0273906
prop_type_simplifiedPrivate room in rental unit -0.0756192 0.0479169
prop_type_simplifiedPrivate room in residential home -0.1463870 0.0496692
number_of_reviews -0.0005262 0.0001460
review_scores_rating 0.0216264 0.0082846
room_typeHotel room 0.0279596 0.0840393
room_typePrivate room -0.6162552 0.0422157
room_typeShared room -1.0707088 0.1129886
t value Pr(>|t|)
(Intercept) 160.527 < 2e-16 ***
prop_type_simplifiedEntire residential home 27.015 < 2e-16 ***
prop_type_simplifiedOther 2.323 0.020208 *
prop_type_simplifiedPrivate room in rental unit -1.578 0.114600
prop_type_simplifiedPrivate room in residential home -2.947 0.003221 **
number_of_reviews -3.605 0.000316 ***
review_scores_rating 2.610 0.009070 **
room_typeHotel room 0.333 0.739378
room_typePrivate room -14.598 < 2e-16 ***
room_typeShared room -9.476 < 2e-16 ***
Residual standard error: 0.5384 on 4936 degrees of freedom
(1165 observations deleted due to missingness)
Multiple R-squared: 0.3956, Adjusted R-squared: 0.3945
F-statistic: 359 on 9 and 4936 DF, p-value: < 2.2e-16
pairs.panels(listings_sydney_golden[c("prop_type_simplified","number_of_reviews","review_scores_rating","room_type")]) This model seems worse than the one prior despite an increased R squared since we can see that “prop_type_simplifiedPrivate room in rental unit” and “room_typeHotel room” are both insignificant.
Our dataset has many more variables.
bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?#we will have more analysis on new variables, so pervious variables should also be included
model_bathrooms <-lm(log_price_4_nights ~ bathrooms_number, data = listings_sydney_golden)
msummary(model_bathrooms) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.56998 0.02076 268.28 <2e-16 ***
bathrooms_number 0.54493 0.01492 36.53 <2e-16 ***
Residual standard error: 0.6503 on 6102 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.1794, Adjusted R-squared: 0.1793
F-statistic: 1334 on 1 and 6102 DF, p-value: < 2.2e-16
It seems as though it is!
#bedrooms
listings_sydney_bedrooms <- listings_sydney_golden
#replacing NA values in bedrooms - using base R as recode is not working (we cannot use this way to change original number)
model_bedrooms <-lm(log_price_4_nights ~ bedrooms, data = listings_sydney_bedrooms)
msummary(model_bedrooms) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.477427 0.015678 349.36 <2e-16 ***
bedrooms 0.512737 0.008669 59.14 <2e-16 ***
Residual standard error: 0.5798 on 5617 degrees of freedom
(492 observations deleted due to missingness)
Multiple R-squared: 0.3838, Adjusted R-squared: 0.3836
F-statistic: 3498 on 1 and 5617 DF, p-value: < 2.2e-16
bedrooms works too!
#beds
listings_sydney_beds <- listings_sydney_golden
#replacing NA values in bedrooms - using base R as recode is not working (we cannot use this method to change any original data)
model_beds <-lm(log_price_4_nights ~ beds, data = listings_sydney_beds)
msummary(model_beds) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.75530 0.01323 434.91 <2e-16 ***
beds 0.27846 0.00579 48.09 <2e-16 ***
Residual standard error: 0.6122 on 6055 degrees of freedom
(54 observations deleted due to missingness)
Multiple R-squared: 0.2764, Adjusted R-squared: 0.2763
F-statistic: 2313 on 1 and 6055 DF, p-value: < 2.2e-16
beds works as well!
model_accommodates <-lm(log_price_4_nights ~ accommodates, data = listings_sydney_golden)
msummary(model_accommodates) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.523914 0.014768 374.06 <2e-16 ***
accommodates 0.224385 0.003868 58.01 <2e-16 ***
Residual standard error: 0.5779 on 6109 degrees of freedom
Multiple R-squared: 0.3552, Adjusted R-squared: 0.3551
F-statistic: 3365 on 1 and 6109 DF, p-value: < 2.2e-16
As expected, accommodates is also significant.
Now what happens when we put the above altogether?
#test colinearity between bathrooms_number, bedrooms, bdes and accommodates
model_4_variables <-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates, data=listings_sydney_golden)
msummary(model_4_variables) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.385095 0.019640 274.194 < 2e-16 ***
bathrooms_number 0.065404 0.017105 3.824 0.000133 ***
bedrooms 0.313939 0.018199 17.250 < 2e-16 ***
beds -0.028267 0.011244 -2.514 0.011969 *
accommodates 0.110133 0.009221 11.944 < 2e-16 ***
Residual standard error: 0.5692 on 5562 degrees of freedom
(544 observations deleted due to missingness)
Multiple R-squared: 0.404, Adjusted R-squared: 0.4035
F-statistic: 942.4 on 4 and 5562 DF, p-value: < 2.2e-16
car::vif(model_4_variables) #generally speaking, we should keep variables with vif ranging from 1 to 10, so all these four variables can be keptbathrooms_number bedrooms beds accommodates
1.664195 4.510377 4.112733 5.564048
autoplot(model_4_variables)+theme_bw()Each variable together gives us a better model than any of the prior but our vif>5 for accommodates
model_4_variables_final <-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates+prop_type_simplified+number_of_reviews+review_scores_rating+room_type, data=listings_sydney_golden)
msummary(model_4_variables_final) Estimate Std. Error
(Intercept) 5.679e+00 4.214e-02
bathrooms_number 1.459e-01 1.685e-02
bedrooms 2.165e-01 1.883e-02
beds -7.348e-03 1.072e-02
accommodates 4.286e-02 8.967e-03
prop_type_simplifiedEntire residential home 1.577e-01 2.784e-02
prop_type_simplifiedOther -1.486e-02 2.649e-02
prop_type_simplifiedPrivate room in rental unit -1.159e-01 4.472e-02
prop_type_simplifiedPrivate room in residential home -2.570e-01 4.626e-02
number_of_reviews -6.772e-05 1.407e-04
review_scores_rating 2.292e-02 7.712e-03
room_typeHotel room 2.554e-01 9.306e-02
room_typePrivate room -4.272e-01 4.044e-02
room_typeShared room -8.936e-01 1.040e-01
t value Pr(>|t|)
(Intercept) 134.769 < 2e-16 ***
bathrooms_number 8.658 < 2e-16 ***
bedrooms 11.496 < 2e-16 ***
beds -0.686 0.49297
accommodates 4.780 1.81e-06 ***
prop_type_simplifiedEntire residential home 5.665 1.56e-08 ***
prop_type_simplifiedOther -0.561 0.57485
prop_type_simplifiedPrivate room in rental unit -2.593 0.00955 **
prop_type_simplifiedPrivate room in residential home -5.555 2.94e-08 ***
number_of_reviews -0.481 0.63043
review_scores_rating 2.972 0.00298 **
room_typeHotel room 2.744 0.00609 **
room_typePrivate room -10.563 < 2e-16 ***
room_typeShared room -8.592 < 2e-16 ***
Residual standard error: 0.4826 on 4491 degrees of freedom
(1606 observations deleted due to missingness)
Multiple R-squared: 0.5421, Adjusted R-squared: 0.5408
F-statistic: 409.1 on 13 and 4491 DF, p-value: < 2.2e-16
car::vif(model_4_variables_final) #generally speaking, we should keep variables with vif ranging from 1 to 5, so all these four variables can be kept, note that prop_type_simplified, room_type and bedrooms all have vif is greater than 5, so we remove prop_type_simplified and hope we can receive a reduced vif for the latter two in a future model. GVIF Df GVIF^(1/(2*Df))
bathrooms_number 1.746976 1 1.321732
bedrooms 5.390611 1 2.321769
beds 4.266013 1 2.065433
accommodates 6.073956 1 2.464540
prop_type_simplified 10.304991 4 1.338539
number_of_reviews 1.054508 1 1.026893
review_scores_rating 1.025472 1 1.012656
room_type 7.733543 3 1.406252
#more reasons such as why this happens
autoplot(model_4_variables_final)+theme_bw()model_4_variables_final2 <-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates+number_of_reviews+review_scores_rating+room_type, data=listings_sydney_golden)
msummary(model_4_variables_final2) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.646e+00 4.200e-02 134.430 < 2e-16 ***
bathrooms_number 1.397e-01 1.690e-02 8.269 < 2e-16 ***
bedrooms 2.505e-01 1.810e-02 13.837 < 2e-16 ***
beds -5.033e-03 1.080e-02 -0.466 0.64129
accommodates 4.140e-02 9.039e-03 4.580 4.77e-06 ***
number_of_reviews -5.254e-05 1.409e-04 -0.373 0.70922
review_scores_rating 2.416e-02 7.768e-03 3.110 0.00188 **
room_typeHotel room 2.348e-01 9.119e-02 2.574 0.01007 *
room_typePrivate room -5.770e-01 1.782e-02 -32.379 < 2e-16 ***
room_typeShared room -9.107e-01 1.022e-01 -8.906 < 2e-16 ***
Residual standard error: 0.4869 on 4495 degrees of freedom
(1606 observations deleted due to missingness)
Multiple R-squared: 0.5336, Adjusted R-squared: 0.5327
F-statistic: 571.5 on 9 and 4495 DF, p-value: < 2.2e-16
car::vif(model_4_variables_final2) GVIF Df GVIF^(1/(2*Df))
bathrooms_number 1.725377 1 1.313536
bedrooms 4.894539 1 2.212360
beds 4.259119 1 2.063763
accommodates 6.065239 1 2.462771
number_of_reviews 1.038256 1 1.018949
review_scores_rating 1.022591 1 1.011233
room_type 1.379121 3 1.055036
autoplot(model_4_variables_final2)+theme_bw()#note that accommodates' vif is greater than 5#coordinates for Sydney Opera house: latitude -33.8568°, longitude 151.2153°
#forumla for distance between two coordinates: sqrt((x1-x2)^2+(y1-y2)^2)
model_opera <-lm(log_price_4_nights ~ distance_opera, data = listings_sydney_golden)
summary(model_opera)
Call:
lm(formula = log_price_4_nights ~ distance_opera, data = listings_sydney_golden)
Residuals:
Min 1Q Median 3Q Max
-2.4848 -0.4984 -0.0873 0.3854 4.6849
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.27267 0.01375 456.263 <2e-16 ***
distance_opera -0.08931 0.12737 -0.701 0.483
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7197 on 6109 degrees of freedom
Multiple R-squared: 8.048e-05, Adjusted R-squared: -8.32e-05
F-statistic: 0.4917 on 1 and 6109 DF, p-value: 0.4832
(host_is_superhost) command a pricing premium, after controlling for other variables?#categorical variables- superhost to show whether it has a pricing premium
model_superhost<-lm(log_price_4_nights~ bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_superhost) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6497921 0.0420983 134.205 < 2e-16 ***
bathrooms_number 0.1383565 0.0169231 8.176 3.79e-16 ***
bedrooms 0.2510826 0.0181064 13.867 < 2e-16 ***
beds -0.0052422 0.0108025 -0.485 0.62750
accommodates 0.0413960 0.0090381 4.580 4.77e-06 ***
host_is_superhostTRUE 0.0288629 0.0206824 1.396 0.16293
number_of_reviews -0.0001195 0.0001488 -0.803 0.42194
review_scores_rating 0.0228035 0.0078283 2.913 0.00360 **
room_typeHotel room 0.2382824 0.0912135 2.612 0.00902 **
room_typePrivate room -0.5767763 0.0178185 -32.370 < 2e-16 ***
room_typeShared room -0.9070708 0.1022715 -8.869 < 2e-16 ***
Residual standard error: 0.4868 on 4494 degrees of freedom
(1606 observations deleted due to missingness)
Multiple R-squared: 0.5338, Adjusted R-squared: 0.5328
F-statistic: 514.6 on 10 and 4494 DF, p-value: < 2.2e-16
car::vif(model_superhost) GVIF Df GVIF^(1/(2*Df))
bathrooms_number 1.731141 1 1.315728
bedrooms 4.897131 1 2.212946
beds 4.259938 1 2.063962
accommodates 6.065240 1 2.462771
host_is_superhost 1.151071 1 1.072880
number_of_reviews 1.158792 1 1.076472
review_scores_rating 1.038628 1 1.019131
room_type 1.381051 3 1.055281
# vif for accommodates is 6.06instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?model_instant_bookable<- lm(log_price_4_nights~bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type, data=listings_sydney_golden)
msummary(model_instant_bookable) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6847640 0.0428951 132.527 < 2e-16 ***
bathrooms_number 0.1384559 0.0168939 8.196 3.22e-16 ***
bedrooms 0.2451502 0.0181338 13.519 < 2e-16 ***
beds -0.0064136 0.0107877 -0.595 0.55219
accommodates 0.0441355 0.0090475 4.878 1.11e-06 ***
host_is_superhostTRUE 0.0302634 0.0206496 1.466 0.14283
instant_bookableTRUE -0.0615180 0.0151154 -4.070 4.78e-05 ***
number_of_reviews -0.0001185 0.0001486 -0.797 0.42525
review_scores_rating 0.0205140 0.0078350 2.618 0.00887 **
room_typeHotel room 0.2614332 0.0912335 2.866 0.00418 **
room_typePrivate room -0.5764499 0.0177879 -32.407 < 2e-16 ***
room_typeShared room -0.9179767 0.1021300 -8.988 < 2e-16 ***
Residual standard error: 0.486 on 4493 degrees of freedom
(1606 observations deleted due to missingness)
Multiple R-squared: 0.5355, Adjusted R-squared: 0.5344
F-statistic: 471 on 11 and 4493 DF, p-value: < 2.2e-16
car::vif(model_instant_bookable) GVIF Df GVIF^(1/(2*Df))
bathrooms_number 1.731145 1 1.315730
bedrooms 4.928978 1 2.220130
beds 4.262972 1 2.064697
accommodates 6.098998 1 2.469615
host_is_superhost 1.151390 1 1.073029
instant_bookable 1.020612 1 1.010254
number_of_reviews 1.158796 1 1.076474
review_scores_rating 1.044010 1 1.021768
room_type 1.387471 3 1.056097
autoplot(model_instant_bookable)+theme_bw()neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nightslistings_sydney_golden %>%
group_by(neighbourhood_cleansed)%>%
summarise(count=n()) %>%
arrange(desc(count)) | neighbourhood_cleansed | count |
|---|---|
| Sydney | 1872 |
| Waverley | 739 |
| Randwick | 476 |
| Marrickville | 293 |
| North Sydney | 265 |
| Woollahra | 247 |
| Warringah | 210 |
| Leichhardt | 200 |
| Manly | 179 |
| Pittwater | 175 |
| Rockdale | 130 |
| Ryde | 109 |
| Botany Bay | 107 |
| Auburn | 103 |
| Sutherland Shire | 77 |
| Willoughby | 71 |
| Canada Bay | 68 |
| Mosman | 68 |
| Parramatta | 66 |
| Hornsby | 65 |
| Canterbury | 62 |
| Ku-Ring-Gai | 56 |
| Burwood | 51 |
| Lane Cove | 51 |
| Ashfield | 50 |
| Blacktown | 49 |
| Bankstown | 39 |
| The Hills Shire | 39 |
| Hurstville | 38 |
| City Of Kogarah | 29 |
| Strathfield | 29 |
| Penrith | 21 |
| Fairfield | 20 |
| Campbelltown | 15 |
| Hunters Hill | 15 |
| Liverpool | 12 |
| Holroyd | 10 |
| Camden | 5 |
# since we have already chosen great sydney as our target city, we will divide neighbourhoods based on their geographic locations into 5 parts-central sydney, east sydney, north sydney, west sydney and south sydeny
neighbourhood_location<-c("central sydney","east sydney","north sydney","west sydney","south sydney")
listings_sydney_golden<-listings_sydney_golden %>%
mutate(neighbourhood_simplified=case_when(
neighbourhood_cleansed %in% c("Sydney") ~ "Central",
neighbourhood_cleansed %in% c("Botany Bay","Camden","Waverley","Randwick","Woollahra") ~ "East",
neighbourhood_cleansed %in% c("North Sydney","Warringah","Manly","Pittwater","Mosman","Hornsby","Ku-Ring-Gai","Lane Cove","Hunters Hill","Willoughby") ~ "North",
neighbourhood_cleansed %in% c("Rockdale","Sutherland Shire","Hurstville","City of Kogarah") ~ "South",
neighbourhood_cleansed %in% c("Marrickville","Leichhardt","Ryde","Auburn","Canada Bay","Parramatta","Canterbury","Burwood","Ashfield","Blacktown","Bankstown","The Hills Shire","Strathfield","Penrith","Fairfield","Campbelltown","Liverpool")~ "West",
TRUE ~ "Other"))
model_neighbourhood<-lm(log_price_4_nights~neighbourhood_simplified,data=listings_sydney_golden)
msummary(model_neighbourhood) Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.27224 0.01610 389.495 < 2e-16 ***
neighbourhood_simplifiedEast 0.02348 0.02383 0.985 0.32455
neighbourhood_simplifiedNorth 0.27925 0.02607 10.712 < 2e-16 ***
neighbourhood_simplifiedOther -0.24591 0.11272 -2.182 0.02918 *
neighbourhood_simplifiedSouth -0.15416 0.04734 -3.257 0.00113 **
neighbourhood_simplifiedWest -0.28815 0.02560 -11.256 < 2e-16 ***
Residual standard error: 0.6967 on 6105 degrees of freedom
Multiple R-squared: 0.0634, Adjusted R-squared: 0.06264
F-statistic: 82.66 on 5 and 6105 DF, p-value: < 2.2e-16
# locations:significant,but not significant in East Sydney. Why? Maybe because of the economic status of that part? find more PEST factors(needs more interpretation)
# then testing whether 'neighbourhood_simplified' is a significant predictor for price_4_nights by controlling other variables
model_neighbourhood_cleansed<-lm(log_price_4_nights~neighbourhood_simplified+bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_neighbourhood_cleansed) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7375381 0.0428014 134.050 < 2e-16 ***
neighbourhood_simplifiedEast -0.0005874 0.0191315 -0.031 0.97551
neighbourhood_simplifiedNorth 0.0571296 0.0207036 2.759 0.00581 **
neighbourhood_simplifiedOther -0.3770220 0.0935316 -4.031 5.65e-05 ***
neighbourhood_simplifiedSouth -0.2533750 0.0375317 -6.751 1.66e-11 ***
neighbourhood_simplifiedWest -0.2835638 0.0203669 -13.923 < 2e-16 ***
bathrooms_number 0.1412038 0.0163555 8.633 < 2e-16 ***
bedrooms 0.2438566 0.0177044 13.774 < 2e-16 ***
beds -0.0118459 0.0104310 -1.136 0.25616
accommodates 0.0503154 0.0087909 5.724 1.11e-08 ***
host_is_superhostTRUE 0.0490761 0.0201049 2.441 0.01469 *
instant_bookableTRUE -0.0603499 0.0146379 -4.123 3.81e-05 ***
number_of_reviews -0.0001319 0.0001442 -0.915 0.36022
review_scores_rating 0.0154186 0.0075722 2.036 0.04179 *
room_typeHotel room 0.2191907 0.0881573 2.486 0.01294 *
room_typePrivate room -0.5381683 0.0173369 -31.042 < 2e-16 ***
room_typeShared room -0.8397026 0.0986974 -8.508 < 2e-16 ***
Residual standard error: 0.4691 on 4488 degrees of freedom
(1606 observations deleted due to missingness)
Multiple R-squared: 0.5677, Adjusted R-squared: 0.5661
F-statistic: 368.3 on 16 and 4488 DF, p-value: < 2.2e-16
car::vif(model_neighbourhood_cleansed) GVIF Df GVIF^(1/(2*Df))
neighbourhood_simplified 1.128783 5 1.012188
bathrooms_number 1.741162 1 1.319531
bedrooms 5.041776 1 2.245390
beds 4.277070 1 2.068108
accommodates 6.178858 1 2.485731
host_is_superhost 1.171245 1 1.082241
instant_bookable 1.027119 1 1.013469
number_of_reviews 1.170844 1 1.082055
review_scores_rating 1.046425 1 1.022949
room_type 1.420312 3 1.060223
# with F statistic's p-value smaller than 0.001, the model itself is significant, and adjusted R-square is getting greater
autoplot(model_neighbourhood_cleansed)+theme_bw()model_neighbourhood_cleansed2<-lm(log_price_4_nights~neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+host_is_superhost+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_neighbourhood_cleansed2) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7381019 0.0422993 135.655 < 2e-16 ***
neighbourhood_simplifiedEast 0.0003965 0.0190050 0.021 0.98336
neighbourhood_simplifiedNorth 0.0567617 0.0205813 2.758 0.00584 **
neighbourhood_simplifiedOther -0.3720965 0.0933802 -3.985 6.86e-05 ***
neighbourhood_simplifiedSouth -0.2568450 0.0373825 -6.871 7.26e-12 ***
neighbourhood_simplifiedWest -0.2826136 0.0202875 -13.930 < 2e-16 ***
bathrooms_number 0.1412536 0.0163204 8.655 < 2e-16 ***
bedrooms 0.2406349 0.0171944 13.995 < 2e-16 ***
accommodates 0.0444407 0.0073031 6.085 1.26e-09 ***
host_is_superhostTRUE 0.0421184 0.0189995 2.217 0.02669 *
instant_bookableTRUE -0.0596457 0.0145907 -4.088 4.43e-05 ***
review_scores_rating 0.0152765 0.0075186 2.032 0.04223 *
room_typeHotel room 0.2132266 0.0879698 2.424 0.01540 *
room_typePrivate room -0.5388664 0.0172415 -31.254 < 2e-16 ***
room_typeShared room -0.8588854 0.0968081 -8.872 < 2e-16 ***
Residual standard error: 0.4688 on 4505 degrees of freedom
(1591 observations deleted due to missingness)
Multiple R-squared: 0.5679, Adjusted R-squared: 0.5665
F-statistic: 422.8 on 14 and 4505 DF, p-value: < 2.2e-16
car::vif(model_neighbourhood_cleansed2) GVIF Df GVIF^(1/(2*Df))
neighbourhood_simplified 1.112782 5 1.010744
bathrooms_number 1.737565 1 1.318167
bedrooms 4.768603 1 2.183713
accommodates 4.276853 1 2.068055
host_is_superhost 1.050257 1 1.024821
instant_bookable 1.026558 1 1.013192
review_scores_rating 1.043291 1 1.021416
room_type 1.359120 3 1.052470
#anova part is to figure out whether neighbourhood_cleansed has impact on model, since F statistic is large enough and p-value is smaller than 0.001, this variable should be kept
anova(model_instant_bookable,model_neighbourhood_cleansed)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 4.49e+03 | 1.06e+03 | ||||
| 4.49e+03 | 988 | 5 | 73.4 | 66.7 | 2.1e-67 |
#not sure what it is for, in order to control other variables, it should first run a linear regression model without this variable, but containing other variables(like x1,x2 ...)and then run a new linear regress model containing all variables
model_immediate_booking<-lm(log_price_4_nights~has_availability,data=listings_sydney_golden)
msummary(model_immediate_booking) Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.23173 0.09291 67.071 <2e-16 ***
has_availabilityTRUE 0.03412 0.09337 0.365 0.715
Residual standard error: 0.7197 on 6109 degrees of freedom
Multiple R-squared: 2.186e-05, Adjusted R-squared: -0.0001418
F-statistic: 0.1335 on 1 and 6109 DF, p-value: 0.7148
# the factor is not significant on alpha=0.001, so it means we should drop this variable in new models
#after adding other variables, it turns out that there is no pricing premium with the variable of model_immediate_booking, since the adjusted R-squared is still 0.447 and the p-value for has_availability is greater than 0.01, so we decided to drop this variable since it cannot be a useful variable for predictions
model_immediate_booking_final<-lm(log_price_4_nights~neighbourhood_simplified+bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_immediate_booking_final) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7375381 0.0428014 134.050 < 2e-16 ***
neighbourhood_simplifiedEast -0.0005874 0.0191315 -0.031 0.97551
neighbourhood_simplifiedNorth 0.0571296 0.0207036 2.759 0.00581 **
neighbourhood_simplifiedOther -0.3770220 0.0935316 -4.031 5.65e-05 ***
neighbourhood_simplifiedSouth -0.2533750 0.0375317 -6.751 1.66e-11 ***
neighbourhood_simplifiedWest -0.2835638 0.0203669 -13.923 < 2e-16 ***
bathrooms_number 0.1412038 0.0163555 8.633 < 2e-16 ***
bedrooms 0.2438566 0.0177044 13.774 < 2e-16 ***
beds -0.0118459 0.0104310 -1.136 0.25616
accommodates 0.0503154 0.0087909 5.724 1.11e-08 ***
host_is_superhostTRUE 0.0490761 0.0201049 2.441 0.01469 *
instant_bookableTRUE -0.0603499 0.0146379 -4.123 3.81e-05 ***
number_of_reviews -0.0001319 0.0001442 -0.915 0.36022
review_scores_rating 0.0154186 0.0075722 2.036 0.04179 *
room_typeHotel room 0.2191907 0.0881573 2.486 0.01294 *
room_typePrivate room -0.5381683 0.0173369 -31.042 < 2e-16 ***
room_typeShared room -0.8397026 0.0986974 -8.508 < 2e-16 ***
Residual standard error: 0.4691 on 4488 degrees of freedom
(1606 observations deleted due to missingness)
Multiple R-squared: 0.5677, Adjusted R-squared: 0.5661
F-statistic: 368.3 on 16 and 4488 DF, p-value: < 2.2e-16
car::vif(model_immediate_booking_final) GVIF Df GVIF^(1/(2*Df))
neighbourhood_simplified 1.128783 5 1.012188
bathrooms_number 1.741162 1 1.319531
bedrooms 5.041776 1 2.245390
beds 4.277070 1 2.068108
accommodates 6.178858 1 2.485731
host_is_superhost 1.171245 1 1.082241
instant_bookable 1.027119 1 1.013469
number_of_reviews 1.170844 1 1.082055
review_scores_rating 1.046425 1 1.022949
room_type 1.420312 3 1.060223
autoplot(model_immediate_booking_final)+theme_bw()availability_30 or reviews_per_month on price_4_nights, after we control for other variables?model_availability<-lm(log_price_4_nights~availability_30+neighbourhood_simplified+bathrooms_number+bedrooms+beds+accommodates+host_is_superhost+instant_bookable+number_of_reviews+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_availability) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7110007 0.0426509 133.901 < 2e-16 ***
availability_30 0.0045961 0.0005866 7.836 5.79e-15 ***
neighbourhood_simplifiedEast -0.0012104 0.0190042 -0.064 0.9492
neighbourhood_simplifiedNorth 0.0499703 0.0205860 2.427 0.0152 *
neighbourhood_simplifiedOther -0.3849392 0.0929141 -4.143 3.49e-05 ***
neighbourhood_simplifiedSouth -0.2589444 0.0372884 -6.944 4.35e-12 ***
neighbourhood_simplifiedWest -0.2925903 0.0202640 -14.439 < 2e-16 ***
bathrooms_number 0.1399203 0.0162473 8.612 < 2e-16 ***
bedrooms 0.2464159 0.0175895 14.009 < 2e-16 ***
beds -0.0136642 0.0103641 -1.318 0.1874
accommodates 0.0487059 0.0087348 5.576 2.60e-08 ***
host_is_superhostTRUE 0.0330220 0.0200758 1.645 0.1001
instant_bookableTRUE -0.0576433 0.0145445 -3.963 7.51e-05 ***
number_of_reviews -0.0002890 0.0001446 -1.999 0.0457 *
review_scores_rating 0.0173582 0.0075258 2.306 0.0211 *
room_typeHotel room 0.1735356 0.0877636 1.977 0.0481 *
room_typePrivate room -0.5473269 0.0172610 -31.709 < 2e-16 ***
room_typeShared room -0.8501996 0.0980491 -8.671 < 2e-16 ***
Residual standard error: 0.466 on 4487 degrees of freedom
(1606 observations deleted due to missingness)
Multiple R-squared: 0.5735, Adjusted R-squared: 0.5719
F-statistic: 354.9 on 17 and 4487 DF, p-value: < 2.2e-16
car::vif(model_availability) GVIF Df GVIF^(1/(2*Df))
availability_30 1.062963 1 1.031001
neighbourhood_simplified 1.134297 5 1.012681
bathrooms_number 1.741339 1 1.319598
bedrooms 5.043515 1 2.245777
beds 4.279216 1 2.068626
accommodates 6.182276 1 2.486418
host_is_superhost 1.183573 1 1.087921
instant_bookable 1.027699 1 1.013755
number_of_reviews 1.193796 1 1.092610
review_scores_rating 1.047558 1 1.023503
room_type 1.432329 3 1.061713
autoplot(model_availability)+theme_bw()#this time we found that host_is_superhost, beds and number of reviews are not significant any more, and adjusted R-squared is much greater than previous models, so considering about whether we need to drop these variable
model_availability<-lm(log_price_4_nights~availability_30+neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_availability) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7057368 0.0421063 135.508 < 2e-16 ***
availability_30 0.0044816 0.0005718 7.838 5.69e-15 ***
neighbourhood_simplifiedEast 0.0008903 0.0188756 0.047 0.9624
neighbourhood_simplifiedNorth 0.0520134 0.0204545 2.543 0.0110 *
neighbourhood_simplifiedOther -0.3779571 0.0928021 -4.073 4.73e-05 ***
neighbourhood_simplifiedSouth -0.2566638 0.0369541 -6.945 4.31e-12 ***
neighbourhood_simplifiedWest -0.2892724 0.0201795 -14.335 < 2e-16 ***
bathrooms_number 0.1415615 0.0162029 8.737 < 2e-16 ***
bedrooms 0.2430378 0.0170733 14.235 < 2e-16 ***
accommodates 0.0417181 0.0072657 5.742 9.99e-09 ***
instant_bookableTRUE -0.0569918 0.0145020 -3.930 8.62e-05 ***
review_scores_rating 0.0180658 0.0073835 2.447 0.0145 *
room_typeHotel room 0.1632153 0.0876349 1.862 0.0626 .
room_typePrivate room -0.5475177 0.0171508 -31.924 < 2e-16 ***
room_typeShared room -0.8732446 0.0961574 -9.081 < 2e-16 ***
Residual standard error: 0.4659 on 4505 degrees of freedom
(1591 observations deleted due to missingness)
Multiple R-squared: 0.5732, Adjusted R-squared: 0.5719
F-statistic: 432.2 on 14 and 4505 DF, p-value: < 2.2e-16
car::vif(model_availability) GVIF Df GVIF^(1/(2*Df))
availability_30 1.016872 1 1.008401
neighbourhood_simplified 1.099250 5 1.009508
bathrooms_number 1.734104 1 1.316854
bedrooms 4.760599 1 2.181880
accommodates 4.286254 1 2.070327
instant_bookable 1.026831 1 1.013327
review_scores_rating 1.018722 1 1.009318
room_type 1.366330 3 1.053398
autoplot(model_availability)+theme_bw()# with this model, we found that host_is_superhost is significant again, but reviews_per_month is not significant.
model_reviews<-lm(log_price_4_nights~reviews_per_month+neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+host_is_superhost+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_reviews) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.745472 0.042354 135.655 < 2e-16 ***
reviews_per_month -0.013421 0.004883 -2.748 0.00601 **
neighbourhood_simplifiedEast -0.003304 0.019039 -0.174 0.86225
neighbourhood_simplifiedNorth 0.054003 0.020591 2.623 0.00875 **
neighbourhood_simplifiedOther -0.375201 0.093319 -4.021 5.90e-05 ***
neighbourhood_simplifiedSouth -0.257738 0.037357 -6.899 5.95e-12 ***
neighbourhood_simplifiedWest -0.284144 0.020280 -14.011 < 2e-16 ***
bathrooms_number 0.139544 0.016320 8.550 < 2e-16 ***
bedrooms 0.238226 0.017204 13.847 < 2e-16 ***
accommodates 0.045393 0.007306 6.213 5.67e-10 ***
host_is_superhostTRUE 0.056827 0.019725 2.881 0.00398 **
instant_bookableTRUE -0.057068 0.014610 -3.906 9.52e-05 ***
review_scores_rating 0.016498 0.007526 2.192 0.02843 *
room_typeHotel room 0.227009 0.088049 2.578 0.00996 **
room_typePrivate room -0.543864 0.017325 -31.392 < 2e-16 ***
room_typeShared room -0.865254 0.096766 -8.942 < 2e-16 ***
Residual standard error: 0.4685 on 4504 degrees of freedom
(1591 observations deleted due to missingness)
Multiple R-squared: 0.5686, Adjusted R-squared: 0.5671
F-statistic: 395.7 on 15 and 4504 DF, p-value: < 2.2e-16
car::vif(model_reviews) GVIF Df GVIF^(1/(2*Df))
reviews_per_month 1.138613 1 1.067058
neighbourhood_simplified 1.119083 5 1.011314
bathrooms_number 1.740093 1 1.319126
bedrooms 4.781013 1 2.186553
accommodates 4.286485 1 2.070383
host_is_superhost 1.133700 1 1.064753
instant_bookable 1.030805 1 1.015286
review_scores_rating 1.046942 1 1.023202
room_type 1.380336 3 1.055190
# then run a model with all other variables we used before as instructions and 'reviews_per_month','availability_30' and host is superhost is siginificant again
model_reviews_availability<-lm(log_price_4_nights~availability_30+reviews_per_month+neighbourhood_simplified+bathrooms_number+bedrooms+accommodates+host_is_superhost+instant_bookable+review_scores_rating+room_type,data=listings_sydney_golden)
msummary(model_reviews_availability) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.717299 0.042178 135.551 < 2e-16 ***
availability_30 0.004861 0.000588 8.267 < 2e-16 ***
reviews_per_month -0.021086 0.004935 -4.273 1.97e-05 ***
neighbourhood_simplifiedEast -0.004090 0.018898 -0.216 0.828680
neighbourhood_simplifiedNorth 0.046369 0.020459 2.266 0.023473 *
neighbourhood_simplifiedOther -0.382915 0.092634 -4.134 3.64e-05 ***
neighbourhood_simplifiedSouth -0.262488 0.037085 -7.078 1.69e-12 ***
neighbourhood_simplifiedWest -0.292827 0.020158 -14.527 < 2e-16 ***
bathrooms_number 0.138071 0.016201 8.523 < 2e-16 ***
bedrooms 0.240459 0.017079 14.079 < 2e-16 ***
accommodates 0.042718 0.007259 5.885 4.28e-09 ***
host_is_superhostTRUE 0.040632 0.019677 2.065 0.038988 *
instant_bookableTRUE -0.052906 0.014511 -3.646 0.000269 ***
review_scores_rating 0.018940 0.007476 2.533 0.011335 *
room_typeHotel room 0.181586 0.087570 2.074 0.038174 *
room_typePrivate room -0.555462 0.017254 -32.194 < 2e-16 ***
room_typeShared room -0.880860 0.096069 -9.169 < 2e-16 ***
Residual standard error: 0.465 on 4503 degrees of freedom
(1591 observations deleted due to missingness)
Multiple R-squared: 0.575, Adjusted R-squared: 0.5735
F-statistic: 380.8 on 16 and 4503 DF, p-value: < 2.2e-16
car::vif(model_reviews_availability) GVIF Df GVIF^(1/(2*Df))
availability_30 1.079408 1 1.038945
reviews_per_month 1.180264 1 1.086399
neighbourhood_simplified 1.123912 5 1.011750
bathrooms_number 1.740303 1 1.319206
bedrooms 4.782210 1 2.186826
accommodates 4.295017 1 2.072442
host_is_superhost 1.145048 1 1.070069
instant_bookable 1.032047 1 1.015897
review_scores_rating 1.048578 1 1.024001
room_type 1.394268 3 1.056958
autoplot(model_reviews_availability)+theme_bw()#anova part is to figure out whether number of reviews has impact on model, since F statistic is large enough and p-value is smaller than 0.001, this variable should be kept
anova(model_availability,model_reviews_availability)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 4.50e+03 | 978 | ||||
| 4.5e+03 | 974 | 2 | 4.18 | 9.67 | 6.43e-05 |
listings_sydney_golden1 <- listings_sydney_golden %>%
select(log_price_4_nights,bathrooms_number, distance_opera,host_since) %>%
ggpairs()
# first, we uses all variables mentioned above and create a residual plot
listings_sydney_golden2 <- listings_sydney_golden %>%
select(log_price_4_nights,#numerical variables only selected, categorical not selected for simplicity
host_listings_count,
accommodates,
bathrooms_number,
bedrooms,
beds,
availability_30, #only keeping availablity in 30 days since the others add little value- tested using corr
number_of_reviews,
review_scores_rating,
reviews_per_month,
distance_opera) %>%
ggpairs(size=1)
listings_sydney_golden2#in this part, we need to delete some high colineared variables(correlation >=0.7-according to the pearson correlation theory) beds and bedrooms has high correlation, meanwhile number of reviews and review per month have high correlationhuxtable that shows which models we worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.# produce summary table comparing models using huxtable::huxreg()
huxreg(model1, model2, model_4_variables,model_4_variables_final,model_superhost, model_instant_bookable,model_neighbourhood_cleansed,model_reviews,model_availability,model_reviews_availability,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
# bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of models')| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 6.277 | 6.302 | 5.385 | 5.679 | 5.650 | 5.685 | 5.738 | 5.745 | 5.706 | 5.717 |
| (0.040) | (0.039) | (0.020) | (0.042) | (0.042) | (0.043) | (0.043) | (0.042) | (0.042) | (0.042) | |
| prop_type_simplifiedEntire residential home | 0.690 | 0.689 | 0.158 | |||||||
| (0.026) | (0.026) | (0.028) | ||||||||
| prop_type_simplifiedOther | -0.165 | 0.064 | -0.015 | |||||||
| (0.023) | (0.027) | (0.026) | ||||||||
| prop_type_simplifiedPrivate room in rental unit | -0.689 | -0.076 | -0.116 | |||||||
| (0.023) | (0.048) | (0.045) | ||||||||
| prop_type_simplifiedPrivate room in residential home | -0.761 | -0.146 | -0.257 | |||||||
| (0.027) | (0.050) | (0.046) | ||||||||
| number_of_reviews | -0.000 | -0.001 | -0.000 | -0.000 | -0.000 | -0.000 | ||||
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |||||
| review_scores_rating | 0.026 | 0.022 | 0.023 | 0.023 | 0.021 | 0.015 | 0.016 | 0.018 | 0.019 | |
| (0.009) | (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | (0.007) | (0.007) | ||
| room_typeHotel room | 0.028 | 0.255 | 0.238 | 0.261 | 0.219 | 0.227 | 0.163 | 0.182 | ||
| (0.084) | (0.093) | (0.091) | (0.091) | (0.088) | (0.088) | (0.088) | (0.088) | |||
| room_typePrivate room | -0.616 | -0.427 | -0.577 | -0.576 | -0.538 | -0.544 | -0.548 | -0.555 | ||
| (0.042) | (0.040) | (0.018) | (0.018) | (0.017) | (0.017) | (0.017) | (0.017) | |||
| room_typeShared room | -1.071 | -0.894 | -0.907 | -0.918 | -0.840 | -0.865 | -0.873 | -0.881 | ||
| (0.113) | (0.104) | (0.102) | (0.102) | (0.099) | (0.097) | (0.096) | (0.096) | |||
| bathrooms_number | 0.065 | 0.146 | 0.138 | 0.138 | 0.141 | 0.140 | 0.142 | 0.138 | ||
| (0.017) | (0.017) | (0.017) | (0.017) | (0.016) | (0.016) | (0.016) | (0.016) | |||
| bedrooms | 0.314 | 0.217 | 0.251 | 0.245 | 0.244 | 0.238 | 0.243 | 0.240 | ||
| (0.018) | (0.019) | (0.018) | (0.018) | (0.018) | (0.017) | (0.017) | (0.017) | |||
| beds | -0.028 | -0.007 | -0.005 | -0.006 | -0.012 | |||||
| (0.011) | (0.011) | (0.011) | (0.011) | (0.010) | ||||||
| accommodates | 0.110 | 0.043 | 0.041 | 0.044 | 0.050 | 0.045 | 0.042 | 0.043 | ||
| (0.009) | (0.009) | (0.009) | (0.009) | (0.009) | (0.007) | (0.007) | (0.007) | |||
| host_is_superhostTRUE | 0.029 | 0.030 | 0.049 | 0.057 | 0.041 | |||||
| (0.021) | (0.021) | (0.020) | (0.020) | (0.020) | ||||||
| instant_bookableTRUE | -0.062 | -0.060 | -0.057 | -0.057 | -0.053 | |||||
| (0.015) | (0.015) | (0.015) | (0.015) | (0.015) | ||||||
| neighbourhood_simplifiedEast | -0.001 | -0.003 | 0.001 | -0.004 | ||||||
| (0.019) | (0.019) | (0.019) | (0.019) | |||||||
| neighbourhood_simplifiedNorth | 0.057 | 0.054 | 0.052 | 0.046 | ||||||
| (0.021) | (0.021) | (0.020) | (0.020) | |||||||
| neighbourhood_simplifiedOther | -0.377 | -0.375 | -0.378 | -0.383 | ||||||
| (0.094) | (0.093) | (0.093) | (0.093) | |||||||
| neighbourhood_simplifiedSouth | -0.253 | -0.258 | -0.257 | -0.262 | ||||||
| (0.038) | (0.037) | (0.037) | (0.037) | |||||||
| neighbourhood_simplifiedWest | -0.284 | -0.284 | -0.289 | -0.293 | ||||||
| (0.020) | (0.020) | (0.020) | (0.020) | |||||||
| reviews_per_month | -0.013 | -0.021 | ||||||||
| (0.005) | (0.005) | |||||||||
| availability_30 | 0.004 | 0.005 | ||||||||
| (0.001) | (0.001) | |||||||||
| #observations | 4946 | 4946 | 5567 | 4505 | 4505 | 4505 | 4505 | 4520 | 4520 | 4520 |
| R squared | 0.361 | 0.396 | 0.404 | 0.542 | 0.534 | 0.536 | 0.568 | 0.569 | 0.573 | 0.575 |
| Adj. R Squared | 0.360 | 0.394 | 0.404 | 0.541 | 0.533 | 0.534 | 0.566 | 0.567 | 0.572 | 0.574 |
| Residual SE | 0.553 | 0.538 | 0.569 | 0.483 | 0.487 | 0.486 | 0.469 | 0.468 | 0.466 | 0.465 |
#by adding more variables+ pervious ones like distance_opera,license, etc. and more categorical variables in final model with relatively highest adjusted R squared and ensure that all variables are siginificant
listings_sydney_golden_final<-listings_sydney_golden %>%
mutate(host_response_rate=parse_number(host_response_rate),
host_acceptance_rate=parse_number(host_acceptance_rate),
amenities_number=length(list(amenities)))
for(i in 1:6111){
listings_sydney_golden_final$amenities_words[i] <- lengths(strsplit(listings_sydney_golden_final$amenities[i],","))} #selecting the data in which we are interested
for(i in 1:6111){
listings_sydney_golden_final$host_verification_words[i] <- lengths(strsplit(listings_sydney_golden_final$host_verifications[i],","))}#selecting the data in which we are interested
#best model currently we have
model_final<-lm(log_price_4_nights~number_of_reviews+bathrooms_number+bedrooms+review_scores_rating+room_type+host_response_rate+host_identity_verified+availability_90+review_scores_communication+review_scores_value+latitude+longitude+host_verification_words,data=listings_sydney_golden_final)
msummary(model_final) Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.668e+02 2.097e+01 -12.724 < 2e-16 ***
number_of_reviews -6.832e-04 1.633e-04 -4.184 3.04e-05 ***
bathrooms_number 1.780e-01 2.486e-02 7.159 1.29e-12 ***
bedrooms 2.707e-01 1.709e-02 15.843 < 2e-16 ***
review_scores_rating 4.607e-01 5.450e-02 8.453 < 2e-16 ***
room_typeHotel room 2.196e-01 9.054e-02 2.425 0.01542 *
room_typePrivate room -6.399e-01 3.000e-02 -21.329 < 2e-16 ***
room_typeShared room -1.509e+00 3.120e-01 -4.837 1.46e-06 ***
host_response_rate -2.296e-03 5.842e-04 -3.930 8.89e-05 ***
host_identity_verifiedTRUE -1.007e-01 4.538e-02 -2.218 0.02670 *
availability_90 -6.670e-04 3.126e-04 -2.134 0.03305 *
review_scores_communication -1.311e-01 4.516e-02 -2.904 0.00374 **
review_scores_value -3.247e-01 4.907e-02 -6.616 5.17e-11 ***
latitude 8.172e-01 1.372e-01 5.956 3.23e-09 ***
longitude 1.989e+00 1.311e-01 15.175 < 2e-16 ***
host_verification_words -1.334e-02 6.282e-03 -2.124 0.03388 *
Residual standard error: 0.4332 on 1446 degrees of freedom
(4649 observations deleted due to missingness)
Multiple R-squared: 0.675, Adjusted R-squared: 0.6716
F-statistic: 200.2 on 15 and 1446 DF, p-value: < 2.2e-16
car::vif(model_final) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.098407 1 1.048049
bathrooms_number 1.991934 1 1.411359
bedrooms 2.261568 1 1.503851
review_scores_rating 4.146465 1 2.036287
room_type 1.295544 3 1.044100
host_response_rate 1.105556 1 1.051454
host_identity_verified 1.208941 1 1.099518
availability_90 1.050237 1 1.024811
review_scores_communication 2.069786 1 1.438675
review_scores_value 3.461040 1 1.860387
latitude 1.093461 1 1.045687
longitude 1.100237 1 1.048922
host_verification_words 1.253325 1 1.119520
#only contains numeric variables
listings_sydney_golden_final_corr<-listings_sydney_golden_final%>%
select(log_price_4_nights,
number_of_reviews,
bathrooms_number,
bedrooms,
review_scores_rating,
host_response_rate,
availability_90,
review_scores_communication,
review_scores_value,
latitude,
longitude,
host_verification_words)
cor_listings_sydney<- cor(listings_sydney_golden_final_corr, method="pearson")
cor_listings_sydney log_price_4_nights number_of_reviews
log_price_4_nights 1.000000000 -0.007192563
number_of_reviews -0.007192563 1.000000000
bathrooms_number NA NA
bedrooms NA NA
review_scores_rating NA NA
host_response_rate NA NA
availability_90 0.078100956 0.182522341
review_scores_communication NA NA
review_scores_value NA NA
latitude 0.213186779 0.017569060
longitude 0.265463676 -0.011502634
host_verification_words 0.038405025 0.133125401
bathrooms_number bedrooms review_scores_rating
log_price_4_nights NA NA NA
number_of_reviews NA NA NA
bathrooms_number 1 NA NA
bedrooms NA 1 NA
review_scores_rating NA NA 1
host_response_rate NA NA NA
availability_90 NA NA NA
review_scores_communication NA NA NA
review_scores_value NA NA NA
latitude NA NA NA
longitude NA NA NA
host_verification_words NA NA NA
host_response_rate availability_90
log_price_4_nights NA 0.07810096
number_of_reviews NA 0.18252234
bathrooms_number NA NA
bedrooms NA NA
review_scores_rating NA NA
host_response_rate 1 NA
availability_90 NA 1.00000000
review_scores_communication NA NA
review_scores_value NA NA
latitude NA 0.08897066
longitude NA -0.13253443
host_verification_words NA -0.06565404
review_scores_communication review_scores_value
log_price_4_nights NA NA
number_of_reviews NA NA
bathrooms_number NA NA
bedrooms NA NA
review_scores_rating NA NA
host_response_rate NA NA
availability_90 NA NA
review_scores_communication 1 NA
review_scores_value NA 1
latitude NA NA
longitude NA NA
host_verification_words NA NA
latitude longitude host_verification_words
log_price_4_nights 0.21318678 0.26546368 0.03840502
number_of_reviews 0.01756906 -0.01150263 0.13312540
bathrooms_number NA NA NA
bedrooms NA NA NA
review_scores_rating NA NA NA
host_response_rate NA NA NA
availability_90 0.08897066 -0.13253443 -0.06565404
review_scores_communication NA NA NA
review_scores_value NA NA NA
latitude 1.00000000 0.08090837 0.01474355
longitude 0.08090837 1.00000000 0.10736374
host_verification_words 0.01474355 0.10736374 1.00000000
corrplot(cor_listings_sydney,method="color",type="lower",tl.cex=1)
corrplot(cor_listings_sydney,method="pie",type="upper",add=TRUE,tl.cex=1,cl.cex=0.5)#comparing majority of models created- removing a couple with lower R squared as only 9 are displayed
huxreg(model_superhost, model_instant_bookable,model_neighbourhood_cleansed,model_neighbourhood_cleansed2,model_reviews,model_availability,model_reviews_availability,model_final,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
# bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of all models')| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5.650 | 5.685 | 5.738 | 5.738 | 5.745 | 5.706 | 5.717 | -266.781 |
| (0.042) | (0.043) | (0.043) | (0.042) | (0.042) | (0.042) | (0.042) | (20.967) | |
| bathrooms_number | 0.138 | 0.138 | 0.141 | 0.141 | 0.140 | 0.142 | 0.138 | 0.178 |
| (0.017) | (0.017) | (0.016) | (0.016) | (0.016) | (0.016) | (0.016) | (0.025) | |
| bedrooms | 0.251 | 0.245 | 0.244 | 0.241 | 0.238 | 0.243 | 0.240 | 0.271 |
| (0.018) | (0.018) | (0.018) | (0.017) | (0.017) | (0.017) | (0.017) | (0.017) | |
| beds | -0.005 | -0.006 | -0.012 | |||||
| (0.011) | (0.011) | (0.010) | ||||||
| accommodates | 0.041 | 0.044 | 0.050 | 0.044 | 0.045 | 0.042 | 0.043 | |
| (0.009) | (0.009) | (0.009) | (0.007) | (0.007) | (0.007) | (0.007) | ||
| host_is_superhostTRUE | 0.029 | 0.030 | 0.049 | 0.042 | 0.057 | 0.041 | ||
| (0.021) | (0.021) | (0.020) | (0.019) | (0.020) | (0.020) | |||
| number_of_reviews | -0.000 | -0.000 | -0.000 | -0.001 | ||||
| (0.000) | (0.000) | (0.000) | (0.000) | |||||
| review_scores_rating | 0.023 | 0.021 | 0.015 | 0.015 | 0.016 | 0.018 | 0.019 | 0.461 |
| (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | (0.007) | (0.007) | (0.055) | |
| room_typeHotel room | 0.238 | 0.261 | 0.219 | 0.213 | 0.227 | 0.163 | 0.182 | 0.220 |
| (0.091) | (0.091) | (0.088) | (0.088) | (0.088) | (0.088) | (0.088) | (0.091) | |
| room_typePrivate room | -0.577 | -0.576 | -0.538 | -0.539 | -0.544 | -0.548 | -0.555 | -0.640 |
| (0.018) | (0.018) | (0.017) | (0.017) | (0.017) | (0.017) | (0.017) | (0.030) | |
| room_typeShared room | -0.907 | -0.918 | -0.840 | -0.859 | -0.865 | -0.873 | -0.881 | -1.509 |
| (0.102) | (0.102) | (0.099) | (0.097) | (0.097) | (0.096) | (0.096) | (0.312) | |
| instant_bookableTRUE | -0.062 | -0.060 | -0.060 | -0.057 | -0.057 | -0.053 | ||
| (0.015) | (0.015) | (0.015) | (0.015) | (0.015) | (0.015) | |||
| neighbourhood_simplifiedEast | -0.001 | 0.000 | -0.003 | 0.001 | -0.004 | |||
| (0.019) | (0.019) | (0.019) | (0.019) | (0.019) | ||||
| neighbourhood_simplifiedNorth | 0.057 | 0.057 | 0.054 | 0.052 | 0.046 | |||
| (0.021) | (0.021) | (0.021) | (0.020) | (0.020) | ||||
| neighbourhood_simplifiedOther | -0.377 | -0.372 | -0.375 | -0.378 | -0.383 | |||
| (0.094) | (0.093) | (0.093) | (0.093) | (0.093) | ||||
| neighbourhood_simplifiedSouth | -0.253 | -0.257 | -0.258 | -0.257 | -0.262 | |||
| (0.038) | (0.037) | (0.037) | (0.037) | (0.037) | ||||
| neighbourhood_simplifiedWest | -0.284 | -0.283 | -0.284 | -0.289 | -0.293 | |||
| (0.020) | (0.020) | (0.020) | (0.020) | (0.020) | ||||
| reviews_per_month | -0.013 | -0.021 | ||||||
| (0.005) | (0.005) | |||||||
| availability_30 | 0.004 | 0.005 | ||||||
| (0.001) | (0.001) | |||||||
| host_response_rate | -0.002 | |||||||
| (0.001) | ||||||||
| host_identity_verifiedTRUE | -0.101 | |||||||
| (0.045) | ||||||||
| availability_90 | -0.001 | |||||||
| (0.000) | ||||||||
| review_scores_communication | -0.131 | |||||||
| (0.045) | ||||||||
| review_scores_value | -0.325 | |||||||
| (0.049) | ||||||||
| latitude | 0.817 | |||||||
| (0.137) | ||||||||
| longitude | 1.989 | |||||||
| (0.131) | ||||||||
| host_verification_words | -0.013 | |||||||
| (0.006) | ||||||||
| #observations | 4505 | 4505 | 4505 | 4520 | 4520 | 4520 | 4520 | 1462 |
| R squared | 0.534 | 0.536 | 0.568 | 0.568 | 0.569 | 0.573 | 0.575 | 0.675 |
| Adj. R Squared | 0.533 | 0.534 | 0.566 | 0.567 | 0.567 | 0.572 | 0.574 | 0.672 |
| Residual SE | 0.487 | 0.486 | 0.469 | 0.469 | 0.468 | 0.466 | 0.465 | 0.433 |
Report the point prediction and interval in terms of price_4_nights. - if you used a log_price_4_nights model, make sure you anti-log to convert the value in $. You can read more about hot to interpret a regression model when some variables are log transformed here #predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction
#Assume that an average rating of at least 90 means host_response_rate is greater than 90
imaginary_sydney_visit <-listings_sydney_golden_final%>%
select(price_4_nights,
log_price_4_nights,
number_of_reviews,
bathrooms_number,
bedrooms,
review_scores_rating,
room_type,
host_response_rate,
host_identity_verified,
availability_90,
review_scores_communication,
review_scores_value,
latitude,
longitude,
host_verification_words)%>%
drop_na()%>%
filter(number_of_reviews>=10,room_type=="Private room",host_response_rate>=90)
predict_price<-exp(predict(model_final,newdata=imaginary_sydney_visit,interval="prediction",level=0.95))
predict_price fit lwr upr
1 449.3834 190.95354 1057.5630
2 238.3850 101.58326 559.4169
3 218.7959 93.15520 513.8913
4 379.6554 161.86760 890.4698
5 207.5001 88.44138 486.8342
6 309.3198 131.82413 725.8062
7 278.5258 118.65809 653.7827
8 364.3536 155.03093 856.3038
9 371.9063 158.23457 874.1092
10 290.6832 123.95277 681.6849
11 260.4699 111.05217 610.9252
12 291.1224 124.12306 682.8083
13 322.1001 137.41445 755.0040
14 317.7818 135.45651 745.5179
15 346.0085 147.52747 811.5225
16 265.9014 113.26181 624.2488
17 320.1537 136.25065 752.2780
18 345.1078 146.70395 811.8347
19 310.1358 132.30000 727.0159
20 273.5197 116.75638 640.7615
21 323.6611 137.73972 760.5397
22 256.5209 109.39937 601.4934
23 248.9907 105.82235 585.8531
24 282.1835 120.42517 661.2197
25 281.5734 120.07364 660.2915
26 303.9777 129.30274 714.6210
27 324.9024 138.26833 763.4546
28 323.5216 137.83436 759.3626
29 367.4907 156.64269 862.1497
30 316.5340 134.85651 742.9657
31 493.7067 209.91034 1161.1924
32 310.9323 132.66269 728.7573
33 295.5593 126.10673 692.7093
34 285.4599 121.48080 670.7839
35 258.2559 110.20906 605.1782
36 246.1501 104.95303 577.3048
37 270.3985 115.29229 634.1739
38 313.6007 133.74888 735.2989
39 368.4495 157.00980 864.6278
40 301.8785 128.80125 707.5289
41 263.7587 112.45140 618.6550
42 306.6588 130.49081 720.6609
43 276.6102 118.01031 648.3605
44 261.3231 111.45434 612.7153
45 289.2663 123.11410 679.6538
46 265.3857 113.07266 622.8702
47 393.2855 167.41347 923.9011
48 238.5348 101.66417 559.6744
49 280.1031 119.20356 658.1828
50 233.8934 99.80827 548.1123
51 337.6913 144.06086 791.5782
52 379.7736 161.57230 892.6529
53 332.8633 141.88578 780.8954
54 214.4016 91.40706 502.8937
55 364.3231 155.25666 854.9157
56 255.1349 108.58705 599.4620
57 293.5014 125.07396 688.7372
58 285.8919 121.98991 670.0076
59 276.6956 118.04502 648.5701
60 271.7879 115.91987 637.2389
61 365.5556 155.94996 856.8831
62 369.0501 156.94116 867.8284
63 211.4666 89.64828 498.8173
64 247.1127 105.41422 579.2833
65 304.9493 130.07438 714.9299
66 260.6110 111.19447 610.8046
67 291.6524 124.38286 683.8653
68 255.8654 109.14442 599.8207
69 242.9699 103.66626 569.4656
70 274.4692 117.12659 643.1789
71 613.9757 261.53085 1441.3833
72 272.5264 116.27313 638.7600
73 307.3064 131.02100 720.7792
74 291.9487 124.49596 684.6332
75 360.9555 153.86079 846.7970
76 290.7470 123.99065 681.7758
77 266.1269 113.46934 624.1643
78 283.0621 120.80257 663.2651
79 299.7111 127.89610 702.3417
80 262.3348 111.80421 615.5362
81 255.5116 108.95161 599.2216
82 369.0314 157.10246 866.8495
83 289.6309 123.59415 678.7219
84 242.6256 103.41111 569.2541
85 239.8123 102.19923 562.7238
86 172.8064 73.59378 405.7687
87 243.8222 103.94970 571.9043
88 256.7148 109.54345 601.6105
89 364.3741 154.96059 856.7888
90 309.2369 131.62018 726.5409
91 328.5795 139.82333 772.1491
92 359.7676 152.89654 846.5380
93 462.3034 196.82991 1085.8329
94 320.7557 136.62898 753.0188
95 264.3460 112.60911 620.5430
96 233.0864 99.39541 546.5972
97 266.3452 113.51656 624.9286
98 318.3004 135.67785 746.7330
99 242.1218 103.27957 567.6143
100 279.8776 119.42918 655.8822
101 267.1034 113.88209 626.4744
102 503.5697 213.53893 1187.5234
103 250.4854 106.78958 587.5382
104 346.4389 147.67916 812.7072
105 257.3493 109.77931 603.2889
106 495.3328 209.96929 1168.5259
107 287.3371 122.09211 676.2322
108 298.5832 127.32552 700.1892
109 191.6047 81.35340 451.2703
110 241.8640 102.87311 568.6443
111 241.7849 102.88199 568.2231
112 261.4647 111.36512 613.8708
113 227.8881 97.07188 534.9952
114 355.2555 151.55822 832.7261
115 292.2484 124.58191 685.5662
116 245.8938 104.88233 576.4912
117 196.6511 83.51397 463.0562
118 311.9531 132.46535 734.6431
119 251.5652 107.30264 589.7812
120 243.7719 103.94137 571.7142
121 383.8995 163.57287 900.9980
122 294.4400 125.54833 690.5301
123 394.3184 168.01513 925.4345
124 321.4803 137.02995 754.2118
125 321.0807 136.72878 753.9952
126 235.7241 100.01742 555.5616
127 212.1349 90.51346 497.1769
128 293.9615 125.36839 689.2753
129 209.9588 89.09893 494.7613
130 270.1487 115.14268 633.8252
131 177.1573 75.52670 415.5445
132 263.5663 112.07317 619.8380
133 342.0950 145.93447 801.9283
134 521.3894 221.94142 1224.8590
135 266.7120 113.79787 625.1020
136 270.8087 115.49463 634.9851
137 280.6930 119.69816 658.2268
138 325.8106 138.84693 764.5294
139 336.9293 143.24127 792.5183
140 343.1346 146.13242 805.7169
141 269.5391 114.99516 631.7773
142 297.7647 127.00936 698.0889
143 264.4761 112.89262 619.5943
144 261.7371 111.65848 613.5345
145 152.9919 65.06404 359.7460
146 152.1226 64.39401 359.3702
147 375.3403 159.89698 881.0695
148 362.8704 154.61717 851.6191
149 197.6487 84.30548 463.3743
150 264.1782 112.38599 620.9859
151 300.6597 127.91582 706.6855
152 216.3032 92.23330 507.2686
153 293.0957 125.07614 686.8221
154 387.6676 164.66847 912.6590
155 276.2001 117.85004 647.3185
156 122.3860 51.74257 289.4780
157 139.3276 59.29910 327.3605
158 333.1031 142.07087 781.0025
159 197.1757 83.59241 465.0930
160 114.4187 48.38859 270.5523
161 291.6972 124.41935 683.8746
162 234.0026 99.67380 549.3642
163 383.5588 162.81745 903.5722
164 235.9550 100.50462 553.9524
165 254.1709 108.16630 597.2550
166 232.4590 98.99250 545.8715
167 226.6353 96.28273 533.4658
168 327.9480 139.46563 771.1571
169 249.4668 106.29998 585.4535
170 246.9635 105.31292 579.1405
171 235.9865 100.57067 553.7360
172 359.1531 152.97537 843.2137
173 375.4582 159.92532 881.4668
174 310.3765 132.22937 728.5337
175 237.1377 100.91329 557.2536
176 202.3028 86.26116 474.4477
177 240.0079 102.34822 562.8218
178 288.5397 123.00801 676.8273
179 239.2047 101.97055 561.1316
180 227.9578 97.21187 534.5517
181 248.8424 105.97315 584.3229
182 247.5387 105.39952 581.3632
183 177.4163 75.49162 416.9543
184 211.5530 90.09725 496.7374
185 215.8600 91.93704 506.8199
186 264.7509 112.98220 620.3900
187 141.3648 60.11456 332.4321
188 284.8612 121.42113 668.3016
189 199.5481 85.08138 468.0159
# there is no confidence interval hereprice_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.When looking at our best model, the following factors stick out as those that significantly influence price_4_nights. The following had coefficients of magnitude greater than 0.1, i.e. if the following variables change then the our price would change a fair bit too:
bathrooms_number bedrooms review_scores_rating All categories in room_type host_identity_verified review_scores_communication review_scores_value latitude longitude
Now from the above we can gather fairly expected qualitative factors for the airbnb market in Sydney- location, number of beds, number of baths, room type are all expected to influence the cost of the room. Similarly characteristics such as what rating the room has should too, as it is likely the more expensive rooms that are nicer. Whether a host identity is verified or not may influence price since those hosts that care enough to get verified are likely those that care more about using their property as a business venture and thus will keep nicer properties that charge more.
From the initial dataset, we have tried to eliminate all repetitive or less significant variables. We have initially statistically analyzed the dataset to comprehend the situation at the beginning by utilizing skrim. Later on, we modified the dataset and started analyzing potential correlations resulting from the graphs. Specifically, it appears that correlations do not result in being linear. In order to do so, after having looked at the dataset through skim and glimpse, we also analyzed specific variables through favstats in order to get peculiar insights and understand the soundness of selected variables of the dataset. After that, we plotted through ggplot graphs to visualize some data. We modified the dataset by dropping not significant variables and identifying numerical variables to perform an analysis to identify their correlations. We did it by using corrplot and ggpairs. The correlations demonstrate that the variables are not linearly related. Outside of the same typology of class (the different types of reviews, for instance), we have correlation coefficients that are quite low. This is because many variables are not linearly correlated even though.
Graphs can be found in the EDA section.
Progressively more variables were added to the models prior to reaching model_final. This model stuck out to us due to the large R-squared and significance of all numerical variables. We used the msummary function to validate this model, and compared this with our previously created models to also check that it was indeed the best. the neibhourhood_simplified factor was removed from our model due to it have a greater variability than what we accept (vif >5)
In question 1, we discussed which variables were most resultant in changes in price. We were able to create a model which justified over two thirds of the changes in price. When fitting our model, our aim was to try and find every factor that had an affect on price and maximise our coverage of justifying the changes in price. In our final model, we ended up with fewer datapoints than we started with due to missing information in some of our factors. Although we still kept a very large portion of the data and kept the model statistically significant with the number of datapoints we had, we could have improved our analysis by keeping more of the data, even if it reduced our coverage of price change slightly.